Conversation
|
|
||
| // hmul | ||
| template <class A, class T, class /*=typename std::enable_if<std::is_integral<T>::value, void>::type*/> | ||
| XSIMD_INLINE T hmul(batch<T, A> const& self, requires_arch<common>) noexcept |
There was a problem hiding this comment.
shouldn't this just be the generic implementation of reduce_mul, and if so, why not using the current generic reduction implementation?
There was a problem hiding this comment.
This function is only used for integral types with sizeof(T) < 4. The reason I implemented like this was because for example when testing reduce_mul for uint8_t with AVX machine. What happens is that it has 32 elements which means general reduce function multiplies and swizzle the batch 5 times (32->16->8->4->2->1) and in the end it calls self.get(0) (which is just too inefficient anyway issue #1133). So I thought it would be more efficient to implement it this way. What it does with uint8_t in my implementation is that it first splits the batch into 2 equal sse4_2 batches multiplies them together and then calls hmul.
There was a problem hiding this comment.
Also the swizzle function for uint8_t is just loading the batch to a buffer and storing the result back. So using that 5 times (in AVX) is just too inefficient compared to split_avx function which splits it using SIMD intrinsics.
There was a problem hiding this comment.
@serge-sans-paille you haven't responded back. If you are open to the idea of reduce_mul implementation. I can fix the code so that it passes all the tests. If you have feedbacks or recommendations to the code I have written, you can write it here and I can fix it after we discuss.
|
I'm also interested in this. @emrys53 could you rebase master? |
It changes nothing for clang which has a nice shuffle optimizer, but it does help gcc a lot. Somehow related to #1132
It changes nothing for clang which has a nice shuffle optimizer, but it does help gcc a lot. Somehow related to #1132
It changes nothing for clang which has a nice shuffle optimizer, but it does help gcc a lot. Somehow related to #1132
While using XSIMD in our library, we realized that a dedicated reduce_mul function was missing. This PR introduces a reduce_mul implementation for all architectures and supported types, with architecture-specific optimizations for SSE, AVX, and AVX512 instruction sets. Relevant unit tests are included.
I was unable to test the AVX512-specific implementation on my machine, which only supports AVX2. For AVX512, I used mm512_reduce_mul{type} intrinsics. According to Intel's documentation, these are sequential rather than parallel. Since I couldn't benchmark or compare alternative implementations, I left it as-is. I'm open to suggestions for improvement.
Motivation
Although XSIMD provides a general reduce function, using it with std::complex or std::complex fails, as as_unsigned_integer_t is not specialized for complex types. I'm not sure whether this is intended behavior, but I wanted to mention it for context. Providing a reduce_mul directly avoids this issue.
This is my first contribution to XSIMD, and I’ve only been using the library for about a week. Feedback is very welcome, and I’m happy to adjust or improve the code based on your review.